30 research outputs found

    Improving X10 Program Performances by Clock Removal

    Get PDF
    International audienceX10 is a promising recent parallel language designed specifically to address the challenges of productively programming a wide variety of target platforms. The sequential core of X10 is an object-oriented language in the Java family. This core is augmented by a few parallel constructs that create activities as a generalization of the well known fork/join model. Clocks are a generalization of the familiar barriers. Synchronization on a clock is specified by the advance() method call. Activities that execute \emph{advances} stall until all existent activities have done the same, and then are released at the same (logical) time. This naturally raises the following question: are clocks strictly necessary for X10 programs? Surprisingly enough, the answer is no, at least for sufficiently regular programs. One assigns a date to each operation, denoting the number of advances that the activity has executed before the operation. Operations with the same date constitute a \emph{front}, fronts are executed sequentially in order of increasing dates, while operations in a front are executed in parallel if possible. Depending on the nature of the program, this may entail some overhead, which can be reduced to zero for polyhedral programs. We show by experiments that, at least for the current X10 runtime, this transformation usually improves the performance of our benchmarks. Besides its theoretical interest, this transformation may be of interest for simplifying a compiler or runtime library

    Xfor: Semantics and Performance

    Get PDF
    This paper introduces a new programming control structure called "xfor" as an extension of the classical "for" construct in C. It is designed to help one programmer to improve data locality on multi-core architectures by allowing him to express the schedule of instructions in an abstract way. This schedule is defined geometrically by mapping the iteration domains relatively to each other onto a unique referential by using specific parameters called grain and offset. A semantic framework is presented which associates a precise meaning with this syntactic construct and serves as a base for applying reliable xfor code transformations and programming strategies. These issues are illustrated with the Red-Black algorithm. Performance measurements carried out with benchmarking programs rewritten by using the xfor construct show significant execution times speed-ups

    A Hermite type adaptive semi-Lagrangian scheme

    Get PDF
    We study a new Hermite type interpolating operator in a semi-Lagrangian scheme for solving the Vlasov equation in the 2D phase space. Numerical results on uniform and adaptive grid are shown and compared with biquadratic Lagrange interpolation in the case of a rotating Gaussian

    Adaptive 2-D Vlasov Simulation of Particle Beams

    Get PDF
    International audienceThis paper presents our progress for the solution of the 4D Vlasov equation on a grid of the phase space, using two adaptive methods. We briefly recall the principle of the two methods and then particularly focus on computer science features - as data structures or parallelization - for the efficient implementation of the methods. Some relevant numerical results are presented

    Efficient Data Structures for a Hybrid Parallel and Vectorized Particle-in-Cell Code

    Get PDF
    International audienceThe contribution of the present work relies on an innovative and judicious combination of several optimization techniques for achieving high performance when using automatic vectorization and hybrid MPI/OpenMP parallelism in a Particle-in-Cell (PIC) code. The domain of application is plasma physics: the code simulates 2d2v Vlasov-Poisson systems on Cartesian grids with periodic boundary conditions. Overall, our code processes 65 million particles/second per core on Intel Haswell (without hyper-threading) and achieves a good weak scaling up to 0.4 trillion particles on 8,192 cores. The optimizations mainly consist in using (i) a structure of arrays for the particles, (ii) an efficient data structure for the electric field and the charge density, and (iii) an appropriate code for automatic vectorization of the charge accumulation and of the positions' update. In particular, we use space-filling curves to enhance data locality while enabling vectorization: starting from a redundant cell-based data structure for the electric field and for the charge density, we compare several space-filling curves for an efficient ordering of these data and we obtain a gain of 36% in the number of L2 and L3 cache misses when using a Morton curve instead of the classical row-major one. In addition, by proposing a specific writing of the updating positions code we achieve a 31% time improvement in that step. The optimizations bring an overall gain in the execution time of 42% with respect to a standard code. The parallelization of the particle loops is simply performed by means of both distributed and shared memory paradigms, without domain decomposition. We explain the weak and the strong scalings of the code bounded as expected by the overhead of the MPI communications

    Efficient Data Layouts for a Three-Dimensional Electrostatic Particle-in-Cell Code

    Get PDF
    International audienceThe Particle-in-Cell (PIC) method is a widely used tool in plasma physics. To accurately solve realistic problems, the method requires to use trillions of particles and therefore, there is a strong demand for high performance code on modern architectures. The present work describes performance results of Pic-Vert, a hybrid OpenMP/MPI and vectorized three-dimensional electrostatic PIC code.The code simulates 3d3v Vlasov-Poisson systems on Cartesian grids with periodic boundary conditions. Overall, it processes 590 million particles/second on a 24-core Intel Skylake architecture, without hyper-threading (25 million particles per second per core).The paper presents extensions in 3d of our preliminary 2d results, with highlights on the difficulties andsolutions proposed for these extensions. Specifically, our main contributions consist in proposing a new space-filling curve in 3d (called L6D) to improve the cache reuse and an adapted loop transformation (strip-mining) to achieve efficient vectorization. The analysis of these optimization strategies is performed in two-stages, first on a 24-core socket and second on a super-computer, from 1 to 3,072 cores, demonstrating significant performance gains and very satisfactory weak scaling results of the code

    Étude et dĂ©veloppement d'un module de contrĂŽle pour une plate-forme de simulation numĂ©rique

    Get PDF
    Ce travail a comme objectif d'étudier diverses solutions pour la mise en place d'une plate-forme de simulation numérique. Celle-ci doit pouvoir rassembler plusieurs programmes développés au sein du projet INRIA-CALVI, dont le but consiste en l'étude mathématique et numérique et la visualisation de divers problÚmes issus essentiellement de la physique des plasmas et des faisceaux de particules. Ce rapport technique présente le développement d'un module de contrÎle et propose une API (Application Programming Interface) à laquelle doivent se conformer les programmes destinés à tourner sur cette plate-forme. Il présente également le langage de script nommé python, ainsi que l'utilisation d'outils permettant d'étendre ses possibilités par des langages compilés

    A Parallel Adaptive Vlasov Solver Based on Hierarchical Finite Element Interpolation

    Get PDF
    We present a parallel adaptive scheme for the Vlasov equation. Our method is based on a way of reducing dependencies between data, thanks to a hierarchical finite element interpolation approach. A specific data distribution pattern yields an efficient implementation. Numerical results are exhibited for a classical beam simulation in the 1D phase space

    Dealing with arithmetic overflows in the polyhedral model

    Get PDF
    International audienceThe polyhedral model provides techniques to optimize Static Control Programs (SCoP) using some complex transforma- tions which improve data-locality and which can exhibit par- allelism. These advanced transformations are now available in both GCC and LLVM. In this paper, we focus on the cor- rectness of these transformations and in particular on the problem of integer overflows. Indeed, the strength of the polyhedral model is to produce an abstract mathematical representation of a loop nest which allows high-level trans- formations. But this abstract representation is valid only when we ignore the fact that our integers are only machine integers. In this paper, we present a method to deal with this problem of mismatch between the mathematical and concrete representations of loop nests. We assume the exis- tence of polyhedral optimization transformations which are proved to be correct in a world without overflows and we provide a self-verifying compilation function. Rather than verifying the correctness of this function, we use an approach based on a validator, which is a tool that is run by the com- piler after the transformation itself and which confirms that the code produced is equivalent to the original code. As we aim at the formal proof of the validator we implement this validator using the Coq proof assistant as a programming language [4]
    corecore